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Adders are crucial logical building blocks found almost in all the modern 
electronic system designs. In the adder architecture design, the fundamental 
issue is the propagation latency in the carry chain. As the length of the input 
operands increases, the length of the carry chain along with it. Parallel prefix 
adders, which address the problem of carry propagation in adders, are the most 
efficient adder topologies for hardware implementation. However, delay reduction 
still could be achieved for very high speed applications. Hence, in this paper design 
of 16bit novel parallel prefix adder is proposed and compared against the existing 
parallel prefix adder architectures. The design and simulation are carried out using 
xilinx vivado for field- programmable gate array (FPGA) simulation and 
Cadence® for ASIC. The results of ASIC implementation demonstrate 17.8% 


Kogge-stone delay reduction while compared to sparse kogge-stone adder. 
Parallel prefix 
Sparse adder This is an open access article under the CC BY-SA license. 
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1. INTRODUCTION 

Very large scale integrated (VLSI) system design comprise of subsystems that are split into various 
modules. One such subsystem is the data path unit. Within a microprocessor, the data path elements are the 
operational unit that performs computing operation based on the instruction. These tasks include memory related 
access, arithmetic operations like addition, multiplication and logical operations. Although performance optimization 
can be handled at any design level, it has the greatest influence at a higher level of abstraction/algorithmic level [1]. 
Practically in all modern processing units, addition is a time-critical operation. The choice of adders for diverse 
applications such as digital signal processor (DSP) is determined by performance criteria such as the area, adder 
delay, and power dissipation. The primary purpose of any design is to reduce power consumption while enhancing 
performance. To achieve performance and limit power dissipation, the adder topology must be deliberately crafted. 

The ripple carry adder calculates the carry bit along with the sum bit, and each bit must wait until the 
previous carry has been computed before it can calculate its own result and carry bits [2], [3]. A carry-look 
ahead adder boosts performance by minimizing the time it takes to determine carry bits. The parallel prefix 
adder (PPA) utilizes a three-step structure of the carry-look ahead adder [4]. The carry generation stage is 
enhanced so that it is parallelizable to reduce time [5]. The advantage of the operation depends on the initial 
inputs. It involves performing an operation in parallel and segmenting it into smaller chunks that are computed 
in parallel giving it the name parallel prefix adder. This results in various research avenues for designing 
parallel prefix adder architectures with reduced power consumption and increased speed of operation. Owing 
to the wide range of design options for achieving area, power, delay efficiency, as well as optimization without 
compromising on trade-offs the parallel prefix adders are a good choice [6]. 
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2. METHOD 
2.1. Existing parallel prefix adders 

The parallel prefix adder has three stages, the first of which is the pre-processing stage, where 
Propagate (Pi) and Generate (Gi) signals are generated as depicted in Figure 1. When the inputs A and B are 
given, the generate and propagate are calculated as in (1) and (2). Gi and Pi indicates whether the carry is 
generated from that bit or whether carry is propagated from that bit respectively. 


A B Cin 
INPUTS | 


PRE-PROCESSING STAGE 
CARRY GENERATION STAGE 


POST-PROCESSING STAGE 
v 


SUM 


Figure 1. Block diagram of parallel prefix adders 


Gi = Ai AND Bi (1) 

Pi = AixOR Bi (2) 

Prefix graph based tree structure is utilized for the PPA carry generation stage [7]. Pairs of generate 
and propagate signals (Gx,Px), (Gy, Py) are fed into the second stage (carry generation stage). This leads to 


calculating group generates and group propagate signals (Gx:y, Px:y), as shown in as in Figure 2 and as 
depicted in (3) and (4) [8]. 


(Ons Fa) (Gy, Py) 


(Gay , Pay ) 


( G, y? P, y ) 
Figure 2. Calculation of carry bit in a prefix graph 
The final cell in each bit functions is to provide the carry bit. The last carry bit is enabled in a 
simultaneous summation of the next bit until the last bit. The carry generate and carry propagate are given in 


the (3), (4) [9]. 
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Gx:y = Gx+ (Px.Gy) (3) 


Per = Px. Py (4) 
In post processing stage, the sum is calculated as given in the (5). 


5S; = P,XOR Gi-t:cin (5) 


The critical route in parallel adders is determined by the carry from the least significant bit adder to the most 
significant bit adder, hence reduction of the critical path for the carry to reach the most significant bits (MSB) 
is necessary [10]. 


2.1.1. Brent-kung adder 

A brent-kung adder is a parallel adder with a layout intended to save chip size and make 
manufacturing easier. Its symmetry and regular construction structure greatly decreases production costs and 
allows it to be employed in pipelined topologies. With a chip size of O (n log n), the addition of n-bit numbers 
can be done in time O (log n), making it a good choice when there are area limitations while maximizing 
performance [11]. 

Figure 3 depicts brent-kung parallel prefix adder which consists of pre-processing stage (stage 1), 
carry generation stages (stages 2-7) and post-processing (stage 8). It is one among the advanced adder 
architecture, simpler in construction and offer significantly less wiring congestion. Hence, it has the lowest 
wiring tracks, which reduces the amount of space needed to implement the architecture. Furthermore, because 
there are fewer wires crossing or overlapping each other, routing becomes much easier. However, the penalty 
is the increase in delay due to increased number of stages. In addition, the fan out of this adder increases, which 
further increases the delay [12]. 
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Figure 3. Architecture of brent kung adder 


2.1.2. Kogge-stone adder 

Kogge-stone adder is a fast adder, which produces carry signals in O (log n) time [13]. The 16-bit 
adder is implemented in four layers and hence incurs less delay. The kogge-stone adder’s high speed of 
operation is due to its minimal logic depth and low fan-out. Because it has the lowest fan-out compared to other 
approaches, it has the shortest latency, making it ideal for high-speed industrial applications [14]. 

As in the architecture diagram shown in Figure 4, pre-processing stage (stage 1), carry generating 
stages (stages 2,3,4), and post-processing stage (stage 5) make up the kogge-stone parallel prefix adder (5). 
While compared to other algorithms, low fan-out lowers delay, making it ideal for fast commercial applications. 
Kogge-stone adders take up a lot of space and have a bunch of overlapping circuitry [15]. With a small rise in 
the number of bits, the number of dot operators rapidly grows. The carry processing stage provides the carries 
equivalent to each bit in the carry propagation step. These actions on bits are performed in parallel. This 
block sets Kingdom of Saudi Arabia (KSA) apart from other adders and is responsible for its outstanding 
performance [16]. 
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Figure 4. Architecture of kogge-stone adder 


2.1.3. Han-carlson adder 

The brent-kung and kogge-stone adder principles are combined in the han-carlson adder [17]. Carry- 
merge operations are only performed on even bits in this system. Odd-bit generate and propagate signals are 
forwarded downwards as depicted in Figure 5. At the end, they recombine with even bits carry signals to 
generate true carry bits. In comparison to kogge-stone, it performs better with lower bit adders. This adder 
offers an optimal speed with comparatively minimal area and power usage. It’s simple to create, with a well- 
balanced maximum fan out and logic levels [18]. As shown in the architecture diagram shown in Figure 2, the 
han-carlson parallel prefix adder consists of pre-processing stage (stage 1), carry generation stages (stage 2-5) 
and post-processing (stage 6). 
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Figure 5. Architteccture of han-carlson adder 


2.1.4. Sparse-4 kogge-stone adder 

The tree structure of the 16 bit sparse kogge-stone adder is similar to that of the kogge-stone adder, 
with the exception that every fourth carry bit is generated while the rest of the tree is skipped. The terminology 
"adder sparsity" refers to the number of carry bits produced by the carry-tree. Since in our sparse kogge-stone 
every 4th bit is generated, it is called sparse-4 kogge-stone adder. It ends with ripple carry adders that provide 
the final sum [19]-[21]. In Figure 6, the black cells calculate both generate and propagate whereas the grey 
cells calculate only the generate bit. 

As we can see from Figure 6 stage | is the pre-processing stage where generate and propagates are 
calculated using (1) and (2) respectively. The stages 2 to 5 are carry generation stages which make use of the 
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(3) and (4). The final computation stage provides the sum with help of 16 ripple carry adders. The Sparse 
Kogge-Stone adder reduces the amount of wiring junctions required for implementation without sacrificing 
much of processing speed. It also enhances the adder’s power and area consumption [22], [23]. 
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Figure 6. Tree structure of sparse-4 kogge-stone adder 


2.2. Proposed delay optimized sparse-4 kogge-stone adder 

The design’s significant potential to improve area and power consumption was demonstrated by the 
sparse-4 kogge- stone adder. The goal was to attain performance characteristics that were in between those of 
the kogge-stone and han- carlson adders. The kogge-stone adder’s high speed could be partially sacrificed for 
better power, area and wire congestion [24]. This is achieved by improving the speed of sparse-4 kogge-stone 
adder which already exhibits low area and power consumption. The number of stages in a parallel prefix adder 
directly influences the speed of the adder. Therefore, as seen in Figure 7 the first stage of the sparse-4 kogge- 
stone adder was removed and the black cell 1, 2 and 3 were calculated directly. The black cell calculates both 
generates and propagates whereas the grey calculates only the generates. The equation for generate and 
propagate of black cell 1 is given as: 


G4:1 = G4+ P4.G3+ P4.P3.G2 + P4.P3.P2.G1 (6) 

P4:1= P4.P3.P2.P1 (7) 

Similarly the other 2 black cells in this stage can be calculated and they all have a fan in of 4. The 
third stage of the sparse-4 kogge-stone adder was also removed except gray cell 1. The gray cell 2 was 
calculated by using black cell 2 and grey cell 1 with a fan in of two instead of calculating with the help of black 
cell 2, black cell 1 and cin (as done in original sparse-4 KSA)which would have had a fan in of 3. The equation 
for gray cell 2 is given as: 

G8: cin = G8:5 + (P8:5.G4:cin) (8) 


And the equation of gray cell 3 is given as: 


G12:cin = G12:9 + P12:9 .G8:5 + P12:9.P8:5 .G4:cin (9) 
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Figure 7. Structure of proposed delay optimized sparse-4 kogge-stone adder 


The calculation of the 16sum bit was problematic as it had the highest delay being at the end of the 
carry chain and ripple carry adders. To tackle this, grey cell four was implemented which directly calculated 
the carry required for calculation of sum 16. It is calculated after stage 4 using grey cell 3 and generates and 
propagates from preprocessing stage. The equation for gray cell four is given as: 


G15:cin = G15 + P15.G14 + P14.G13.P15 + P15. P14. P13.G12: cin (10) 


After this equation is obtained, the P [16] required for the calculation of sum using (5) is calculated in the yellow 
box. This and the carry given by the (10) are used to calculate sum 16 directly which reduces delay significantly. 


3. RESULTS AND DISCUSSION 

The parallel prefix adders mentioned were simulated in xilinx vivado using spartan-7 field- 
programmable gate array (FPGA) xc7s50fpga484-2. For the power simulation, default values were used in the 
software. The outputs of the simulated 16-bit adders using ModelSim are shown in Figures 8-12. 
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Figure 8. FPGA Simulation result of sparse 4 Figure 9. FPGA simulation result of proposed delay 
kogge-stone adder optimized sparse 4 kogge-stone adder 
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Figure 10. FPGA simulation result of han-carlson Figure 11. FPGA simulation result of kogge-stone 
adder adder 
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Figure 12. FPGA Simulation result of kogge-stone adder 


The results depicted in Table 1 show that in the FPGA implementation, kogge-stone is the fastest 
adder which is 13.3% faster than han-carlson and 13.1% faster than brent-kung. However, it consumes 55.5% 
more area than han-carlson and 44.8% more than brent-kung. It also consumes 44.3% more power than the 
other two adders. The FPGA implementation of sparse 4 kogge-stone adder is actually faster than the kogge- 
stone adder while it is expected to be slower. This could be likely due to the impact of the routing overhead 
[25]. The modified sparse 4 kogge- stone adders is faster than sparse kogge-stone by 5.4% while consuming 
the same area and power as that of the sparse kogge-stone adder. ASIC implementation results are being shown 
below. 


Table 1. FPGA Results of 16-BIT parallel prefix adders 


Parallel Prefix Adder Slice LUTs _ Logic Power (mW) __ Total Delay (ns) 
Kogge-Stone 42 257 10.501 
Han-Carlson 27 178 12.122 
Brent-Kung 29 178 12.078 
Sparse-4 Kogge-Stone 28 168 10.123 
Proposed Sparse-4 Kogge-Stone 28 168 9.570 


The ASIC circuits have also been built and analyzed using Cadence® Genus using 180nm technology. 
The results are shown in Figures 13-15. As we can see kogge-stone is faster than han-carlson by 49% and 
brent-kung by 50.9%. However, it consumes 54.3% more area than han-carlson and 50.7% than brent-kung. It 
also consumes 31.5% more power than han-carlson and 28.2% more than brent-kung. The sparse kogge-stone 
adder is slower than the kogge-stone adder by 87.5% while consuming 26.5% less area and 8.8% less power 
as expected. The proposed delay optimized sparse kogge-stone adder is faster than sparse kogge-stone adder 
by 17.8% and slower than kogge-stone adder by 54% while consuming 2.2% less power than sparse kogge- 
stone and 10.7% less power than kogge-stone. It also consumes 9.7% area than sparse kogge-stone and 19.3% 
less area than the kogge-stone. The proposed delay optimized sparse-4 kogge-stone adder performance is in 
between kogge-stone and han-carlson adder. 
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Figure 13. Comparison of 16-bit adder area - ASIC implementation 
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4. CONCLUSION 

The performance analysis of various 16-bit parallel prefix adders such as han-carlson, brent-kung, and 
kogge-stone are performed using FPGA platform and ASIC synthesis. The ASIC was performed using TSMC 
180nm technology. The simulation results depict Kogge-Stone adder is the fastest adder with increased power 
and area consumption. Han-carlson adder occupied less area than brent-kung but was slower in the FPGA 
implementation. The proposed delay optimized sparse-4 kogge-stone adder demonstrates 17.8% increase in 
speed performance over sparse-4 kogge-stone adder and 10.7% power reduction while compared with kogge- 
stone adder. It would be worthwhile to implement custom adders using sparsity-2 andimplementing sparsity 
on other adders. 
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